7 July 2018

Thanks/Acknowledgements

  • Anne Owen, Nik Lomax, LIDA
  • Janet Boutell, Jim Lewsey, University of Glasgow
  • Gerry McCartney, Jane Parkinson, NHS Health Scotland

Key ideas

  • Informal models precede formal models
  • Learning statistical modelling involves learning to say something in a rareified language
  • But before trying to say something you must have something to say
  • Standard linear modelling approaches encourage a 'variable-based' way of thinking about the world in which effects are assumed to be independent and partitionable
  • Lexis surface: a way of looking at population data which encourages a 'case-based' way of thinking about what the data show, in which the independent/partitionable effect assumption is not 'baked in'
  • Starting with the Lexis surface allows better informal models to be developed before the statistical modelling stage

Some additional ideas

  • Quantitative research isn't about numbers, it's about patterns
  • Demography (demos: the people; -graphy: describing) is the grandmother of the social sciences
  • People are good at images but bad at numbers
  • People are good at complex gestalts; bad at linear sequence
  • The Lexis Surface (maps of age-time) allows data to be explored much as maps of space can be explored

Structure of talk

  • Gestalt examples
  • Defining a data visualisation
  • Example 1: All-cause mortality data
  • Example 2: Despair-related deaths in Scotland
  • Concluding remarks

Introduction

Examples of Gestalts

Case-based reasoning as Gestalts

Grammar of Graphics 101

Data visualisations

  • Not all graphics are data visualisations
  • Data visualisations require a consistent application of mapping rules

Mapping rules:

  • Variable in data \(->\) graphical feature
  • Can be specified formally using the Grammer of Graphics

Examples of variables in data

  • Age
  • Year
  • Gender
  • Death rate
  • Crime rate
  • Fertility rate
  • Health scores
  • Political Attitudes

etc

Examples of Graphical features

  • Position across horizontal axis
  • Position across vertical axis
  • Colour of marks
  • Size of dot/width of line
  • Transparency
  • whether lines are solid or dashed
  • Colour in filled areas between marks
  • Angle

etc

Example

This set of mapping rules…

Variable in Dataset Graphical Feature
% of population providing free care Position along vertical axis
% of population with health needs Position along horizontal axis
Size of population in areal unit Size of circle
Whether in North or south of England Colour of bubble


Applied to 2001 English/Welsh Census data…

Example

Population data

Population data are data where:

  • Something
  • Has been recorded consistently about types of people
  • For different ages
  • and at different times

Example 1: All-cause mortality

Example of data in this format

From the Human Mortality Database

FALSE # A tibble: 1,445,886 x 6
FALSE    country  year   age sex    death_count population_count
FALSE    <chr>   <int> <int> <chr>        <dbl>            <dbl>
FALSE  1 AUS      1921     0 female      3842.            62758.
FALSE  2 AUS      1921     1 female       586.            57766.
FALSE  3 AUS      1921     2 female       390.            57014.
FALSE  4 AUS      1921     3 female       254.            58307.
FALSE  5 AUS      1921     4 female       176.            58711.
FALSE  6 AUS      1921     5 female       146.            59875.
FALSE  7 AUS      1921     6 female       128.            61023.
FALSE  8 AUS      1921     7 female       112.            59465.
FALSE  9 AUS      1921     8 female        97.0           57746.
FALSE 10 AUS      1921     9 female        83.8           56186.
FALSE # ... with 1,445,876 more rows

Decoding this

  • country and sex are grouping variables (i.e. categorical not cardinal)
  • year and age are continuous variables
  • death_count and population_count are attributes that are specific to different combinations of country, sex, year, and age.

1.4 million rows!

Standard ways of exploring

  • Sweeping by year: Life expectancies, crude mortality rates
  • Sweeping by age: 'Bathtub curves'
  • Conditional sweeping by year: different age groups
  • etc

Infant mortality

Infant mortality

Other ages

Relationship with age

Bathtub, Scotland, all years

Bathtub, Scotland, all years, by gender

A variable-based approach (most quantitative research)

  • Simple linear regression: Regress one variable on one variable
  • Multiple linear regression: Regress one variable on multiple variables
  • Assume independence between explanatory variables
  • (Usually) assess statistical significance of regression coefficients

An example

## 
## Call:
## lm(formula = log(death_rate) ~ sex, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5977 -1.1075 -0.1401  1.2590  3.2585 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -3.60459    0.02012 -179.148  < 2e-16 ***
## sexmale      0.22777    0.02845    8.005 1.35e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.326 on 8682 degrees of freedom
## Multiple R-squared:  0.007326,   Adjusted R-squared:  0.007212 
## F-statistic: 64.07 on 1 and 8682 DF,  p-value: 1.353e-15

An example

## 
## Call:
## lm(formula = log(death_rate) ~ sex + years_since_first, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62526 -0.38316  0.01859  0.40022  1.70713 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.9755770  0.0246242   39.62   <2e-16 ***
## sexmale            0.2277717  0.0120983   18.83   <2e-16 ***
## years_since_first -0.0234263  0.0001181 -198.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5637 on 8681 degrees of freedom
## Multiple R-squared:  0.8206, Adjusted R-squared:  0.8205 
## F-statistic: 1.985e+04 on 2 and 8681 DF,  p-value: < 2.2e-16

An example

## 
## Call:
## lm(formula = log(death_rate) ~ sex + poly(years_since_first, 
##     2), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63102 -0.25707 -0.03739  0.22951  1.37089 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -3.605e+00  5.590e-03 -644.88   <2e-16 ***
## sexmale                      2.278e-01  7.905e-03   28.81   <2e-16 ***
## poly(years_since_first, 2)1 -1.118e+02  3.683e-01 -303.59   <2e-16 ***
## poly(years_since_first, 2)2 -3.976e+01  3.683e-01 -107.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3683 on 8680 degrees of freedom
## Multiple R-squared:  0.9234, Adjusted R-squared:  0.9234 
## F-statistic: 3.488e+04 on 3 and 8680 DF,  p-value: < 2.2e-16

An example

## 
## Call:
## lm(formula = log(death_rate) ~ sex + poly(years_since_first, 
##     3), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53729 -0.24378 -0.03294  0.22219  1.31336 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -3.605e+00  5.488e-03 -656.77   <2e-16 ***
## sexmale                      2.278e-01  7.762e-03   29.35   <2e-16 ***
## poly(years_since_first, 3)1 -1.118e+02  3.616e-01 -309.18   <2e-16 ***
## poly(years_since_first, 3)2 -3.976e+01  3.616e-01 -109.95   <2e-16 ***
## poly(years_since_first, 3)3 -6.509e+00  3.616e-01  -18.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3616 on 8679 degrees of freedom
## Multiple R-squared:  0.9262, Adjusted R-squared:  0.9261 
## F-statistic: 2.722e+04 on 4 and 8679 DF,  p-value: < 2.2e-16

An example

## 
## Call:
## lm(formula = log(death_rate) ~ sex * poly(years_since_first, 
##     2), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62939 -0.25635 -0.03894  0.22798  1.37746 
## 
## Coefficients:
##                                       Estimate Std. Error  t value
## (Intercept)                         -3.605e+00  5.587e-03 -645.155
## sexmale                              2.278e-01  7.901e-03   28.827
## poly(years_since_first, 2)1         -1.127e+02  5.207e-01 -216.458
## poly(years_since_first, 2)2         -3.907e+01  5.207e-01  -75.039
## sexmale:poly(years_since_first, 2)1  1.769e+00  7.363e-01    2.402
## sexmale:poly(years_since_first, 2)2 -1.385e+00  7.363e-01   -1.881
##                                     Pr(>|t|)    
## (Intercept)                           <2e-16 ***
## sexmale                               <2e-16 ***
## poly(years_since_first, 2)1           <2e-16 ***
## poly(years_since_first, 2)2           <2e-16 ***
## sexmale:poly(years_since_first, 2)1   0.0163 *  
## sexmale:poly(years_since_first, 2)2   0.0600 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3682 on 8678 degrees of freedom
## Multiple R-squared:  0.9235, Adjusted R-squared:  0.9234 
## F-statistic: 2.095e+04 on 5 and 8678 DF,  p-value: < 2.2e-16

An example

## 
## Call:
## lm(formula = log(death_rate) ~ sex + is_scotland + poly(years_since_first, 
##     2), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63147 -0.25585 -0.03673  0.22849  1.36892 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 -3.603e+00  5.641e-03 -638.724   <2e-16 ***
## sexmale                      2.278e-01  7.903e-03   28.822   <2e-16 ***
## is_scotlandTRUE             -4.947e-02  2.124e-02   -2.329   0.0199 *  
## poly(years_since_first, 2)1 -1.119e+02  3.687e-01 -303.387   <2e-16 ***
## poly(years_since_first, 2)2 -3.982e+01  3.690e-01 -107.907   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3682 on 8679 degrees of freedom
## Multiple R-squared:  0.9235, Adjusted R-squared:  0.9234 
## F-statistic: 2.618e+04 on 4 and 8679 DF,  p-value: < 2.2e-16

An example

## Analysis of Variance Table
## 
## Model 1: log(death_rate) ~ sex
## Model 2: log(death_rate) ~ sex + years_since_first
## Model 3: log(death_rate) ~ sex + poly(years_since_first, 2)
## Model 4: log(death_rate) ~ sex + poly(years_since_first, 3)
## Model 5: log(death_rate) ~ sex * poly(years_since_first, 2)
## Model 6: log(death_rate) ~ sex + is_scotland + poly(years_since_first, 
##     2)
##   Res.Df     RSS Df Sum of Sq          F  Pr(>F)    
## 1   8682 15261.4                                    
## 2   8681  2758.5  1   12502.9 92243.6938 < 2e-16 ***
## 3   8680  1177.5  1    1581.1 11664.6875 < 2e-16 ***
## 4   8679  1135.1  1      42.4   312.5759 < 2e-16 ***
## 5   8678  1176.2  1     -41.1                       
## 6   8679  1176.8 -1      -0.5     3.8829 0.04881 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A case-based approach

  • Embrace inherent complexity
  • Interactions between factors norm not the exception?
  • Imagine tens of thousands of values not as hived off into distinct variables (age, year)…
  • But forming a complex surface of values over age-time

How? By representing the surface on a map

Mapping a spatial map

Data Variable Aesthetic
Latitude Horizontal position
Longitude Vertical position
Elevation Colour/shade/contour lines

Mapping an age-time map (Lexis surface)

Data Variable Aesthetic
Year Horizontal position
Age Vertical position
Mortality rate Colour/shade/contour lines

Lexis surface for Scotland

Example 2: Cause specific mortality in Scotland

NHS Health Scotland Example

  • Three causes of death:
    • Drug related
    • Alcohol
    • Suicide
  • Two deprivation groups (Carstairs)
    • Most deprived fifth of areas
    • Less deprived four fifths of areas
  • Two genders
    • Males
    • Females

A variable-based approach

  • Intrinsic Estimator (IE)
  • Partition into Age, Period and Cohort (APC) Effects. Slices data at:
    • Age effects: 0 degrees
    • Cohort effects: 45 degrees
    • Period effects: 90 degrees
  • Identifiability problem: Impossible to do in theory (models 'underspecified')
  • Constraints applied

Results from IE: Alcohol-related Age effects

Results from IE: Alcohol-related Period effects

Results from IE: Alcohol-related Cohort effects

Results from IE: Drug-related Age effects

Results from IE: Drug-related Period effects

Results from IE: Drug-related Cohort effects

Lexis surfaces for males in most deprived areas

A typology of reasoning

Mode Informal Formal
Geometric (1) Informal
Geometric
(2) Formal
Geometric
Aetiologic (3) Informal
Aetiologic
(4) Formal
Aetiologic


  • Focus here is on (1), (2), and (3). Not (4)

What do the surfaces show, geometrically?

  • Alcohol-related:
    • Curve with age, peaking around age 60, then falling
    • 'Hotspot' in 2000s
  • Drug-related:
    • 'Truncated triangle'
    • Ageing in effect (increased risk from around 18-25)
    • Cohort-demarcated change
    • Period effect (early 1990s)
  • Suicide:
    • 'Noisier' version of drug-related

Developing and comparing models

Findings

  • The model based on the geometric 'reading' of the Lexis surface for the Alcohol data fits the Alcohol data better than DRD or Suicide data
  • The model based on the geometric 'reading' of the DRD/Suicide Lexis surfaces fit these data better than the Alcohol data
  • Both models fit the data they are based on better than most naive, over-flexible models (not shown) designed to curve to many shapes
  • Unlike some naive, over-flexible models, the parameters in the bespoke models have some meaningful interpretations

Table of 'truncated triangle' model parameters

Parameter Drug-related Suicide Alcohol
First year of effect 1988 1987 1980
First age affected 15 17 15
Peak age 25 25 9
First cohort affected 1942 1938 1961
Peak cohort 1968 1964 1997
Fit Good Good Bad

Suggestion: A Lexis surface 'workflow'

Summary/Conclusion

Final thoughts

  • Before struggling to say something, you should have something to say
  • Your thoughts about population processes should not be limited by what you can easily express in a statistical model. (i.e. don't worry about formalisation too early)
  • Think in structures, not slices
  • For making decisions, informal aetiology, guided by informal geometry, may be sufficient

Example of decisions that could be informed by above observations

  • Observation: DRD/suicide patterns closely linked
  • Observation: Both disproportionately affect males in deprived areas
    • Approx 10-fold higher hazard than females in less deprived areas
  • Observation: Both disproportionately affect cohorts born in 1960s ('Trainspotter Generation')
  • Decision: Link mental health services to drug support services?
  • Decision: Ensure services best located to serve those in greatest need
  • Decision: Explore further why male hazards ~ 3x higher than female hazards in same area
    • Qual research in gender:area interactions and experience
    • Explore how services can be adapted to more effectively engage males at high risk of suicide/DRD
  • Decision: Actively target high risk demographics?

And finally:

  • Thanks for listening!